Welcome to the R project of Daniel Teixeira and Christian Bester. In this project, we will slice and dice the avocado dataset to see if we can get some meaningful insights from the much beloved fruit of the millenials!
The ultimate purpose of this project is to attempt some hypothesis testing and (hopefully) confirmation, using the statistical programming software, R.
In this section, we will set up the R workspace before we proceed with our statistical analysis.
rm(list = ls())
The first step is to clear the R environment. The whole point of working with R is reproducibility, which means that you shouldn’t be working with previously saved data. So we begin by clearing the R environment.
pkgs <- c("Hmisc", "ggfortify", "kableExtra", "lubridate", "scales", "sjPlot", "tidyverse")
sapply(pkgs, function(x) if(!x %in% installed.packages()) install.packages(x, repos = "http://cran.us.r-project.org"))
sapply(pkgs, library, character.only = TRUE)
As this is a reproducible research that will be shared with other researchers, the code needs to check if the packages used here is installed on the users PC before loading the package, otherwise an error will be thrown and the document will not be created.
This test is done using an sapply, which essentially loops over the vector of packages and checks if it is not listed in the installed packages. If this is true, the package is installed.
We then load the packages required into R, first by saving the list of packages we need to a vector. and then using sapply to load the packages in one line, rather than multiple calls to library. This is very useful if one has a multitude of packages.
options("scipen"=999)
This code removes scientific notation for up to 999 digits.
Exploratory Data Analysis (EDA) is very important in data analytics. It is through this process that one is able to formulate various hypotheses that can then be formally tested.
This section is therefore the longest portion of this document, which will go through various stages from the initial data imports, data cleaning and manipulation to data visualisation and hypothesis formation.
We read in the data using the read.csv function from the utils package, which is installed with R and is loaded when R is launched. We set check.names to FALSE to prevent periods being placed where there is a space character. We save the data to pdata, where p stands for project. Hence, the name of the data is project data. Always try to have informative names for objects when programming.
We use the names function, also from utils, to see the column names of the dataset, so that we can know the variable names of the dataset.
pdata <- read.csv("avocado.csv", check.names = FALSE)
names(pdata)
The output of the preceding command is as follows:
[1] "" "Date" "AveragePrice" "Total Volume" "4046"
[6] "4225" "4770" "Total Bags" "Small Bags" "Large Bags"
[11] "XLarge Bags" "type" "year" "region"
These are the variables collected in the dataset. Immediately, we can see that there is an issue with the first column, the column name is blank. So we have no clue of what is in it.
The snapshot of the preceding dataset is shown here using the glimpse function:
glimpse(pdata)
Rows: 18,249
Columns: 14
$ `` <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ Date <chr> "2015-12-27", "2015-12-20", "2015-12-13", "2015-12-0...
$ AveragePrice <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02...
$ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 5...
$ `4046` <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 1...
$ `4225` <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 4...
$ `4770` <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80...
$ `Total Bags` <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91...
$ `Small Bags` <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47...
$ `Large Bags` <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05...
$ `XLarge Bags` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
$ type <chr> "conventional", "conventional", "conventional", "con...
$ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015...
$ region <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "A...
One might be tempted into believing that the first column in simply a row or observation count beginning from zero. This might not be the case. To test this, we reverse the dataframe and then use glimpse on the reversed data. What this does is allow us to see the observations at the end of the data through the glimpse function, which otherwise only shows data of the first few observations.
pdata %>% map_df(rev) %>% glimpse()
Rows: 18,249
Columns: 14
$ `` <int> 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 9, 8, ...
$ Date <chr> "2018-01-07", "2018-01-14", "2018-01-21", "2018-01-2...
$ AveragePrice <dbl> 1.62, 1.93, 1.87, 1.71, 1.63, 1.57, 1.56, 1.57, 1.54...
$ `Total Volume` <dbl> 17489.58, 16205.22, 13766.76, 13888.04, 17074.83, 15...
$ `4046` <dbl> 2894.77, 1527.63, 1191.92, 1191.70, 2046.96, 1924.28...
$ `4225` <dbl> 2356.13, 2981.04, 2452.79, 3431.50, 1529.20, 1368.32...
$ `4770` <dbl> 224.53, 727.01, 727.94, 0.00, 0.00, 0.00, 0.00, 0.00...
$ `Total Bags` <dbl> 12014.15, 10969.54, 9394.11, 9264.84, 13498.67, 1269...
$ `Small Bags` <dbl> 11988.14, 10919.54, 9351.80, 8940.04, 13066.82, 1243...
$ `Large Bags` <dbl> 26.01, 50.00, 42.31, 324.80, 431.85, 256.22, 223.18,...
$ `XLarge Bags` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
$ type <chr> "organic", "organic", "organic", "organic", "organic...
$ year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
$ region <chr> "WestTexNewMexico", "WestTexNewMexico", "WestTexNewM...
So, the nameless column is not equal to the number of observations in the data. We will give it an arbitrary name so that it can be selected in operations.
colnames(pdata)[1] <- "code"
Descriptive statistics is a method of summarizing a dataset quantitatively. These summaries can be simple quantitative statements about the data or a visual representation sufficient enough to be part of the initial description about the dataset.
To get a set of summary statistics, one can use the base R function, summary. However, this function is better used for numerical data. The avocado dataset has numerical and categorical data. The Hmisc package has a function called describe, which provides extra level of summary statistics. In addition to what is given by summary, such as missing values, distribution of numerical variables, and distinct values of categorical variables.
d <- describe(pdata)
We don’t print the output to the console so that it is formatted as html, using the html function, also from Hmisc, because the output of this document is html file. We specifically instruct R to import the function from Hmisc to avoid potential namespace conflicts with the html function from the htmltools package.
Hmisc::html(describe(pdata), size=85, tabular=TRUE,
greek=TRUE, scroll=TRUE, rows=25, cols=125,
border = 2)
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18249 | 0 | 53 | 1 | 24.23 | 17.85 | 2 | 4 | 10 | 24 | 38 | 46 | 49 |
| n | missing | distinct |
|---|---|---|
| 18249 | 0 | 169 |
| lowest : | 2015-01-04 | 2015-01-11 | 2015-01-18 | 2015-01-25 | 2015-02-01 |
| highest: | 2018-02-25 | 2018-03-04 | 2018-03-11 | 2018-03-18 | 2018-03-25 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18249 | 0 | 259 | 1 | 1.406 | 0.4512 | 0.83 | 0.93 | 1.10 | 1.37 | 1.66 | 1.93 | 2.11 |
n missing distinct Info Mean Gmd .05 .10 .25
18249 0 18237 1 850644 1450905 2372 3897 10839
.50 .75 .90 .95
107377 432962 1387046 3716315
| lowest : | 84.56 | 379.82 | 385.55 | 419.98 | 472.82 |
| highest: | 46324529.70 | 47293921.60 | 52288697.89 | 61034457.10 | 62505646.52 |
n missing distinct Info Mean Gmd .05
18249 0 17702 1 293008 524489 19.60
.10 .25 .50 .75 .90 .95
94.28 854.07 8645.30 111020.20 538385.18 1263359.68
| lowest : | 0.00 | 1.00 | 1.13 | 1.19 | 1.20 |
| highest: | 17076650.82 | 17787611.93 | 18933038.04 | 21620180.90 | 22743616.17 |
n missing distinct Info Mean Gmd .05 .10
18249 0 18103 1 295155 509218 103.6 367.5
.25 .50 .75 .90 .95
3008.8 29061.0 150206.9 500784.6 1303657.7
| lowest : | 0.00 | 1.26 | 1.28 | 1.30 | 1.31 |
| highest: | 17896391.60 | 18956479.74 | 20328161.55 | 20445501.03 | 20470572.61 |
n missing distinct Info Mean Gmd .05 .10 .25
18249 0 12071 0.973 22840 41861 0 0 0
.50 .75 .90 .95
185 6243 31492 106157
| lowest : | 0.00 | 0.83 | 1.00 | 1.01 | 1.09 |
| highest: | 1811090.71 | 1880231.38 | 1896149.50 | 1993645.36 | 2546439.11 |
n missing distinct Info Mean Gmd .05 .10
18249 0 18097 1 239639 405113 628.9 1299.2
.25 .50 .75 .90 .95
5088.6 39743.8 110783.4 442141.9 1005478.9
| lowest : | 0.00 | 3.09 | 3.11 | 3.19 | 3.33 |
| highest: | 15804696.31 | 15972492.07 | 16298296.29 | 16394524.11 | 19373134.37 |
n missing distinct Info Mean Gmd .05 .10 .25
18249 0 17321 1 182195 309915 256.7 583.1 2849.4
.50 .75 .90 .95
26362.8 83337.7 354266.9 768147.2
| lowest : | 0.00 | 2.52 | 2.57 | 2.73 | 2.79 |
| highest: | 11392828.89 | 11712807.19 | 12540327.19 | 12567155.58 | 13384586.80 |
n missing distinct Info Mean Gmd .05 .10 .25
18249 0 15082 0.998 54338 96628 0.0 0.0 127.5
.50 .75 .90 .95
2647.7 22029.2 94295.3 195699.8
| lowest : | 0.00 | 0.97 | 1.30 | 1.33 | 1.38 |
| highest: | 3988101.74 | 4023485.04 | 4081397.72 | 4324231.19 | 5719096.61 |
n missing distinct Info Mean Gmd .05 .10 .25
18249 0 5588 0.712 3106 5871 0.0 0.0 0.0
.50 .75 .90 .95
0.0 132.5 3688.9 12058.5
| lowest : | 0.00 | 1.00 | 1.11 | 1.26 | 1.30 |
| highest: | 377661.06 | 387400.22 | 390478.73 | 454343.65 | 551693.65 |
| n | missing | distinct |
|---|---|---|
| 18249 | 0 | 2 |
Value conventional organic Frequency 9126 9123 Proportion 0.5 0.5
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 18249 | 0 | 4 | 0.911 | 2016 | 1.031 |
Value 2015 2016 2017 2018 Frequency 5615 5616 5722 1296 Proportion 0.308 0.308 0.314 0.071
| n | missing | distinct |
|---|---|---|
| 18249 | 0 | 54 |
| lowest : | Albany | Atlanta | BaltimoreWashington | Boise | Boston |
| highest: | Syracuse | Tampa | TotalUS | West | WestTexNewMexico |
There are a few more functions in R that are useful for summary analysis, which can be applied to the variables individually. To get to know the standard deviation of the dataset, we can use the sd function that can be applied to the numerical variables only. To do this, we use sumarise_if to the data, where only the variables that are numeric are selected for the summary operation
data_sd <- pdata %>%
summarise_if(is.numeric, funs(sd))
Here is the output from the above operation
| code | AveragePrice | Total Volume | 4046 | 4225 | 4770 | Total Bags | Small Bags | Large Bags | XLarge Bags | year |
|---|---|---|---|---|---|---|---|---|---|---|
| 15.48104 | 0.4026766 | 3453545 | 1264989 | 1204120 | 107464.1 | 986242.4 | 746178.5 | 243966 | 17692.89 | 0.9399385 |
We can represent the data presented by summary in a graphical format using a boxplot. We use the ggplot2 package to create the box plot. The box plot can be plotted for the numerical columns only; hence in the aes call, we map the type of avocado to x and the average price to y.
Inside the geom_boxplot call, we specify the color, shape and size of the outliers. The theme_classic function gives a similar theme to the base plot function.
ggplot(pdata, aes(x = type, y = AveragePrice, fill = type))+
geom_boxplot(outlier.color = "#192812", outlier.shape = 16, outlier.size = 2)+
scale_fill_manual(values = c("#2b501c", "#8cb575"))+
theme_classic()+
theme(legend.position="bottom")
This box plot shows the median, first quartile, and third quartile values for all the variables. The outliers can also be shown in the dataset. In the following boxplot, the outliers have been enabled:
We can see from the box plot above that organic avocados are more expensive on average than conventional avocados. Moreover, the outliers of the organic avocado are more extreme, showing that the price of the organic avocado can swing more violently than conventional avocados.
A density plot shows the distribution of the underlying data. Here we use a density plot to show the distribution of the sales volume of the types of avocado. Note that we transform the data by applying a log scale to the x data, which is volume, to circumvent large outliers from distorting the plot. This will allow us to see which type of avocado does more sales volume.
pdata %>%
ggplot(aes(x=`Total Volume`, fill=type))+
scale_fill_manual(values=c("#2b501c", "#8cb575"))+
scale_x_log10()+
geom_density(alpha=0.7)+
theme_classic()+
theme(legend.position="bottom")
As can be seen from the plot below, the conventional avocado has a higher sales volume than the organic avocado. Is this because of the price?
Let’s look at how a few different variables relate to each other, e.g: Average price and volume. The law of demand and supply states that as supply increases, price reduces. Is this the case with Avocado? Let us find out.
pdata %>%
ggplot(aes(`Total Volume`, AveragePrice, col=type)) +
geom_point() +
geom_smooth(method="lm")+
scale_x_log10()+
scale_color_manual(values = c("#2b501c", "#8cb575"))+
theme_classic()+
theme(legend.position="bottom")
The output from the code above is shown below:
As you can see, an increase in the volume of avocado tends to reduce the price.
We will use a bar chart to show the monthly price swings of avocados. We will filter the regions to the total US region. The code below creates a new column which shows the percentage change in the price.
growth.mon <- pdata %>%
filter(region == "TotalUS") %>%
select(Date, AveragePrice, type) %>%
arrange(type, Date) %>%
group_by(type) %>%
mutate(Rate = (AveragePrice - lag(AveragePrice)) / lag(AveragePrice),
Date = ymd(Date)) %>%
select(-AveragePrice)
growth.mon[is.na(growth.mon)] <- 0
The code below uses multiple calls to geom_rect with different fills to create a striped background.
ggplot(growth.mon, aes(x = Date, y = Rate))+
geom_rect(ymin = 0.8, ymax = Inf,
xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
geom_rect(ymin = 0.6, ymax = 0.8,
xmin = 0, xmax = 1000000, fill = '#fbfcfc') +
geom_rect(ymin = 0.4, ymax = 0.6,
xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
geom_rect(ymin = 0.2, ymax = 0.4,
xmin = 0, xmax = 1000000, fill = '#fbfcfc')+
geom_rect(ymin = 0, ymax = 0.2,
xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
geom_rect(ymin = -0.2, ymax = 0,
xmin = 0, xmax = 1000000, fill = '#fbfcfc')+
geom_rect(ymin = -0.4, ymax = -0.2,
xmin = 0, xmax = 1000000, fill = '#f5f6f9')+
geom_rect(ymin = -0.6, ymax = -0.4,
xmin = 0, xmax = 1000000, fill = '#fbfcfc')+
geom_bar(stat = "identity", aes(fill = type),
show.legend = TRUE, position = "dodge")+
# scale_fill_continuous(low = "#0095db")+
scale_fill_manual(name = "Change in average price", values = c("#2b501c", "#43362e"))+
scale_y_continuous(NULL, labels = percent_format(), limits = c(min(growth.mon$Rate), max(growth.mon$Rate)))+
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")+
theme(axis.text.x=element_text(angle=60, hjust=1))+
theme(legend.position = "top",
legend.background = element_rect(color = NA),
legend.key = element_rect(fill = NA, color = NA))
The output of the code above is shown below:
The graph shows that from the tail end of 2017, price changes of the conventional type of avocado have been more severe than price changes of the organic type. What could have caused this?
Finally, we look at the progression of the average avocado price.
pdata %>%
mutate(Date = ymd(Date))%>%
ggplot(aes(x=Date, y=AveragePrice, color=type))+
geom_line()+
theme_classic()+
scale_color_manual(values = c("#2b501c", "#ed6325"))+
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y",expand=c(0,0))+
theme(axis.text.x=element_text(angle=60, hjust=1))+
theme(legend.position = "top")
Output:
Looks like the average price of the organic avocado has a wider range than the conventional avocado.
A linear model is a way of mathematically representing a relationship between two variables. This relationship can be shown to be statistically significant, in which case the relationship is likely to be real or statistically insignificant, in which case the relationship is virtually non-existent.
Performing a linear regression in R is remarkably simple, since R is equipped with the easy to use lm function. In this section, we will be fitting linear models to the relationships we visualised in the previous section.
The most interesting observation we made in the previous section was that the price of avocados goes down as volume sold goes up. We will now test if this relationship is statistically significant.
conmod <- pdata %>%
filter(type=="conventional") %>%
mutate(Volume = log(`Total Volume`)) %>%
lm(AveragePrice ~ Volume, data = .)
orgmod <- pdata %>%
filter(type=="organic") %>%
mutate(Volume = log(`Total Volume`)) %>%
lm(AveragePrice ~ Volume, data = .)
We save the model in order to display the results in a table, rather than printing to the console. To this, we use the tab_model function from the sjPlot package.
tab_model(conmod, orgmod)
The code chunk outputs the table below:
tab_model(conmod, orgmod)
| Â | AveragePrice | AveragePrice | ||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 1.75 | 1.70 – 1.81 | <0.001 | 2.07 | 2.02 – 2.12 | <0.001 |
| Volume | -0.05 | -0.05 – -0.04 | <0.001 | -0.04 | -0.05 – -0.04 | <0.001 |
| Observations | 9126 | 9123 | ||||
| R2 / R2 adjusted | 0.053 / 0.053 | 0.030 / 0.030 | ||||
The results show that indeed, the average price of avocado decreases as volume increases. The p-value is shows that the negative coefficient is statistically significant.
We examine if the fitted model is any good. A quick way to do this is with autoplot and ggfortify installed. This will give a ggplot2 version of the regression diagnostics plot from base R.
autoplot(conmod)
autoplot(orgmod)
The output is shown below
autoplot(conmod)
autoplot(orgmod)
We can see that the model roughly follows a normal distribution as the deviations are not too steep. Furthermore, the standard deviation of the residuals does not exceed 3 and are mostly below 2.
We can therefore conclude that the price of avocado drops as volume increases.